Skip to content

Conversation

@vchuravy
Copy link
Member

Trying to fix #41 (comment)

@vchuravy
Copy link
Member Author

This alone is not sufficient to resolve the data-race.

@vchuravy vchuravy requested a review from anicusan June 25, 2025 07:26
@vchuravy vchuravy marked this pull request as ready for review June 25, 2025 07:26
@vchuravy
Copy link
Member Author

@anicusan I currently don't have a MWE that triggers anymore, but I believe this is closer to the spirit of what you intended to implement?

@giordano
Copy link

This fixes #46 on AMGPU MI300 as per #46 (comment)

@vchuravy vchuravy force-pushed the vc/unsafe_atomics branch from c43a4a3 to d4698ab Compare July 1, 2025 06:53
@vchuravy vchuravy changed the base branch from main to vc/accumulate_alg July 1, 2025 06:53
@vchuravy vchuravy changed the title attempt to use UnsafeAtomics to fix race in accumulate Use UnsafeAtomics to fix race in accumulate Jul 1, 2025
@anicusan
Copy link
Member

anicusan commented Jul 4, 2025

Apologies for the radio silence, I've been away for a conference. This is extremely useful, thank you for digging into this @vchuravy and @giordano . Two questions from my side:

  • Is UnsafeAtomics stable / will it be supported in KA in the future?
  • Is there any reason you hard-coded UInt8 as the flag type? It is the smallest and simplest, but if a user wants to supply some temporary buffer they have lying around, it could work with any integer, right?

@vchuravy
Copy link
Member Author

vchuravy commented Jul 8, 2025

UnsafeAtomics is stable and will be supported by JuliaGPU (it's the underpinning of Atomix).

Is there any reason you hard-coded UInt8 as the flag type? It is the smallest and simplest, but if a user wants to supply some temporary buffer they have lying around, it could work with any integer, right?

Yeah, store! doesn't convert the flag to the eltype of the pointer. But I added that conversion into the code here.

@vchuravy vchuravy requested a review from anicusan July 8, 2025 09:51
@christiangnrd
Copy link
Member

christiangnrd commented Jul 8, 2025

Might be worth a rebase since there have been changes to the accumulate tests pushed to main.

Metal compilation fails because it only supports 32-bit integers (and float) for these atomic operations. However, DecoupledLookback tests failed when I changed the flag to use UInt32 so maybe we should reinstate ScanPrefixes as the default for that platform?

Finally, I ran this on a 3080, and I got 1 failed test (test/accumulate.jl line 47 or 51 after rebase). I couldn't reproduce until I increased the # iterations for the "small block sizes -> many blocks" test to 10000 and then I get <10 failures.

@vchuravy
Copy link
Member Author

vchuravy commented Jul 8, 2025

I really would prefer not undoing the Metal change without understanding the why. The memory semantics ought to be the same, and a we have seen with more capable micro architecture the latent bugs in this algorithm show up.

You are seeing failures at large enough problem sizes on CUDA as well?

@christiangnrd
Copy link
Member

You are seeing failures at large enough problem sizes on CUDA as well?

I am. Also reproduces on the un-rebased version of this PR. Either more iterations or bigger arrays will increase the odds of triggering the bug.

@anicusan
Copy link
Member

The Metal backend will not be able to support the DecoupledLookback algorithm - that was the primary reason for developing ScanPrefixes (issue / PR).

I realize I made a mistake when reading the Metal docs and unintentionally assumed more functionality than is guaranteed. threadgroup_barrier in MSL is effectively OpenCL C 1.2's barrier intrinsic which only has a memory scope of Workgroup. This means (in the context of gpuweb/gpuweb#2229) that inter-workgroup communication cannot be support correctly on all underlying platforms. (ref)

The decoupled lookback assumes that changes made to a device array (flags and v) from a workgroup will be visible from other workgroups. From the above discussions it seems that Metal does not offer such barriers at all.

We have large-scale accumulate tests though - at least enough to saturate the SMs - and I have not been able to reproduce accumulate bugs on CUDA / AMD; @christiangnrd do you have an MWE?

@vchuravy on oneAPI we seem to have the following atomics error:

error: undefined reference to `_Z18__spirv_AtomicLoadPU3AS1cii'
--
  | │ in function: '__spirv_AtomicLoad(char AS1*, int, int)' called by kernel: 'gpu__accumulate_previous_(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tuple<OneTo<Int64> > >, NDRange<1, DynamicSize, StaticSize<_256__>, CartesianIndices<1, Tuple<OneTo<Int64> > >, void> >, _, oneDeviceArray<Int32, 1, 1>, _<UInt8, 1, 1>, Int32)'

Any ideas?

@vchuravy
Copy link
Member Author

We widen the semantics on Metal recently to match semantics across all backends. JuliaGPU/Metal.jl#609

So either this algorithm is sound everywhere or it is sound nowhere.

For oneAPI @maleadt do you know of the top of your head which atomics are legal?

@anicusan
Copy link
Member

Are you sure the Metal Shading Language supports that? See this post:

Decoupled look-back cannot run on Metal either, which is the underlying reason for the above. Also surprising, because I had imagined otherwise from reading the Metal Shading Language Specification. I believe that language will also be clarified.

@anicusan
Copy link
Member

Further to the above, from the gpuweb/gpuweb#2297 discussion (who had to change their scan because of Metal alone):

Metal's mem_fence_flags only affects the storage aspect of memory semantics, not the memory scope

it means Metal's threadgroup_barrier doesn't reliably support message passing between one threadgroup (workgroup) and another via device memory

about what's needed for message passing across workgroups (where the message can't be packed in a 32 bit word along with the flags), no, from what I can tell it can't be done in Metal at all (confirmed by testing and discussions with experts)

Also, in the MSL v4 Spec, Table 6.13. "Memory flag enumeration values for barrier functions" shows (bold mine):

mem_device | The flag ensures the GPU correctly orders the memory operations to device memory for threads in the threadgroup or simdgroup.

So your addition in JuliaGPU/Metal.jl#609 was correct anyways, but device memory writes are still only guaranteed to be visible from/within workgroups, not across them.

I can't really reduce the problem to a smaller MWE than the _accumulate_previous! function - decoupled lookback really is the MWE for Apple Metal not allowing message passing across workgroups (as shown in the post and GPUWeb discussion above).


What is left to do for this PR:

  • Add back the extension with alg=ScanPrefixes() for Metal only.
  • Make the atomic load/stores work on oneAPI.

@anicusan
Copy link
Member

Related, @christiangnrd what did your benchmarks show regarding DecoupledLookback vs ScanPrefixes? The former has lower asymptotic complexity (for many blocks, so should be more scalable) and only two kernel launches, while the latter has three kernel launches but possibly a lower time constant. On a small benchmark with 1 million Float32 on Google Colab I see:

CUDA.jl accumulate:     207.351 μs
AK DecoupledLookback:   182.458 μs
AK ScanPrefixes:        101.358 μs

@christiangnrd
Copy link
Member

MWE:

using Test, CUDA
import AcceleratedKernels as AK

for i in 1:10000
    num_elems = rand(1:100_0000)
    x = CuArray(rand(Int32(1):Int32(1000), num_elems))
    y = copy(x)
    AK.accumulate!(+, y; init=0)

    res = Array(y) .== accumulate(+, Array(x))

    passed = all(res)
    if !passed
        @info sum(.!res)
        # @info i,collect(1:num_elems)[.!res]
    end
    @test passed
end

@christiangnrd
Copy link
Member

what did your benchmarks show regarding DecoupledLookback vs ScanPrefixes?

Algorithm Float32 512k Float32 3M Int64 512k Int64 3M
CUDA.jl accumulate 98.160 μs 421.993 μs 106.651 μs 447.884 μs
AK DecoupledLookback 89.881 μs 453.07 μs 119.161 μs 600.878 μs
AK ScanPrefixes 53.120 μs 264.760 μs 94.610 μs 454.240 μs

So results are closer with Int64 than Float32, but DecoupledLookback is slightly better/worse or much worse than the other 2.

Run on a 3060.

@anicusan
Copy link
Member

It seems ScanPrefixes is almost always faster than DecoupledLookback and the CUDA.jl base implementation - and it doesn't depend on fickle cross-workgroup message passing. I will merge this into vc/accumulate_alg, then keep the switch to ScanPrefixes for all platforms, and remove ext/Metal.

@anicusan anicusan merged commit 1b17354 into vc/accumulate_alg Jul 21, 2025
37 of 38 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Wrong cumsum results on ROCBackend

5 participants